Temporally extended successor feature neural episodic control

One of the long-term goals of reinforcement learning is to build intelligent agents capable of rapidly learning and flexibly transferring skills, similar to humans and animals. In this paper, we introduce an episodic control framework based on the temporal expansion of subsequent features to achieve these goals, which we refer to as Temporally Extended Successor Feature Neural Episodic Control (TESFNEC). This method has shown impressive results in significantly improving sample efficiency and elegantly reusing previously learned strategies. Crucially, this model enhances agent training by incorporating episodic memory, significantly reducing the number of iterations required to learn the optimal policy. Furthermore, we adopt the temporal expansion of successor features a technique to capture the expected state transition dynamics of actions. This form of temporal abstraction does not entail learning a top-down hierarchy of task structures but focuses on the bottom-up combination of actions and action repetitions. Thus, our approach directly considers the temporal scope of sequences of temporally extended actions without requiring predefined or domain-specific options. Experimental results in the two-dimensional object collection environment demonstrate that the method proposed in this paper optimizes learning policies faster than baseline reinforcement learning approaches, leading to higher average returns.

of tasks in a way that can cut down cognitive load and enhance the generalization ability of jobs with shared structure [40][41][42][43][44][45] .One formal approach to addressing this form of abstraction is the options framework 39,[46][47][48][49][50] .Agents using options seek to learn a set of policies related to different subtasks, along with their initiation and termination conditions.While this successfully quantifies the problem of temporal abstraction, the process can become more complex than learning simple value functions if there are no predefined, handcrafted domainspecific options [51][52][53][54][55][56] .
In this paper, we introduce a new method that combines the flexibility of sample-efficient learning with the advantages of temporal abstraction using episodic control.Specifically, this model enhances agent training by incorporating episodic memory, facilitating the learning of superior policies.It leverages episodic memory to store historical high-reward experience data.It uses this information to guide agent training, significantly reducing the number of iterations required to learn the optimal strategy.During training, the model can dynamically extract historical high-reward information from episodic memory and seamlessly integrate this information into the neural network, optimizing sample efficiency more effectively.Furthermore, we adopt the temporal expansion of successor features a technique to capture the expected state transition dynamics of actions [57][58][59] .This is achieved by constructing successor features on top of repeated primitive actions.This form of action abstraction does not entail learning a top-down hierarchy of task structures but focuses on the bottom-up combination of actions and repeated actions.The bottom-up approach depends on a natural idea: solution of any subproblem depends only on the solution of smaller subproblems.It uses extra memory to store the solution to sub-problems, avoids recomputation and improves performance by a huge margin.So bottom-up approach sorts the subproblems by their input size and solves them iteratively in the order of smallest to largest.In other words, when solving a particular subproblem, bottom-up approach will first solve all of the smaller subproblems its solution depends upon and store their values in extra memory.With the bottom-up approach, sometimes, there is scope to optimize the time and space complexity further.As such, our method directly considers the time horizon of temporally extended action sequences without the need for predefined or domain-specific options As a result, it reduces the number of decisions necessary to learn the optimal strategy without the need for hierarchical policy learning.Thus, our approach directly considers the temporal scope of sequences of temporally extended actions without requiring predefined or domain-specific options.Experimental results show that the method proposed in this paper optimizes learning policies faster than baseline reinforcement learning approaches, leading to higher average returns.

Preliminaries
In this section, we develop the foundations that will help us understand the concepts, techniques, and mathematical underpinnings essential to the contributions of this paper.In particular, we introduce the key definitions and notation of the RL problem alone with episodic control and successor features.

Reinforcement learning
The Reinforcement Learning (RL) problem is typically formalized as a Markov Decision Process (MDP), which is a key underlying assumption for much of this paper as well.A finite, discrete-time MDP is a six tuple of S, A, R, T , γ , ρ 0 , where S is a finite set of states, A represents a finite set of actions, R : S × A → R is the reward function, T : S × A × S → [0, 1] denotes a state transition dynamics, ρ 0 represents the initial state dis- tribution, and γ ∈ [0, 1] denotes a discount factor.The behavior of agent is determined by its behavior , denoted π : S → [0, 1] , a mapping from states to probabilities of taking each of admissible original actions.The value of being in a state is determined by the state value function V π (s) = E π [G t |S t = s] , defined as the expected cumulative return starting from state s, and then following strategy π.
Similarly, the value of being in a state s and taking an action a is called the state-action value function Traditionally, actions executed by agents were single-step primitive actions, meaning that the agent resampled its strategy at state s ′ to determine the following action.Recent research has extended this action selection strategy by introducing action duration, specifying how often the chosen action is repeated before the policy is queried [60][61][62][63] .In this way, the agent possesses an action selection policy π a : S → A and an action repetition strategy π j : S → J .Here, we follow the description in Ref. 61 , where j represents the number of action repetitions, and J is the set of allowed action repetition counts.
The same methods to learn standard policies can be employed to discover the action repetition policy π j .This paper uses a Q-learning-style update rule for learning the action repetition, as derived in Ref. 61 .This approach is related to constructing temporal extended successor features.However, there are also methods for learning the repetition policy through policy gradient learning.The optimal repetition policy π j * can be obtained by greedily selecting from the optimum Q π j * (s, j, a) , where repeated action values are derived from transitions in the environment based on the following conditions: (1)

Episodic control
Episodic memory draws from the psychological and cognitive research on human memory 64,65 and follows the principles of instance-based decision theory 66 .Extensive research has applied episodic memory to reinforcement learning to enhance sample efficiency.For instance, in Ref. 29 , a context control method was proposed that uses episodic memory to store experiences, allowing the agent to replicate state-action sequences with high rewards.Gershman et al. 67 utilized episodic memory to construct context reinforcement learning for state value function estimation.Lin et al. 68 introduced a regularization term into the objective function to distill information from episodic memory into parameter models, significantly improving the performance of DQN (Deep Q-Network).Lin et al. 29 introduced a method known as Neural Episodic Control (NEC) 27 , which employs differentiable neural dictionaries to record slowly changing state-action pairs and rapidly updated state function values.NEC corrects policies by looking up values from the context of state functions.NEC adopts two techniques to enhance the extendibility of the model.Firstly, the number of elements involved in lookups is limited to the 50 nearest neighbors, and this can be efficiently achieved using a kd-tree.Removing the least recently adopted items controls the differentiable neural dictionary (DND) size.Additionally, the values saved in the episodic memory pool are N-step Q-value estimates: During learning, Q-learning 69 is utilized to update the Q-values in the episodic memory pool for the keys already found in the DND.This process, carried out by , helps refine and update the Q-value estimates for the stored experiences, thereby contributing to the learning process.

Successor features
Successor features (SF) are based on representing the value function in a way that separates reward information from the environmental state transition matrix 57,58,70 .Successor features assume that rewards are a linear combination of characteristics φ t = φ(s t , a t , s t+1 ) ∈ R n , where this combination depends on state transitions and a weighted vector ω ∈ R n : r t = φ T t ω .Here, the features describe critical information about states, which are used to evaluate them using the reward function in a low-dimensional representation.Consequently, the Q-function can be reformulated as: where ψ π (s t , a t ) represents the successor features of state-action pair (s t , a t ) under strategy π.

Temporally extended successor feature neural episodic control
In this section, we introduce Temporally Extended Successor Feature Neural Episodic Control (TESFNEC) to extend SFNEC to learn temporally extended successor features ψ ∈ R n in position scalar state-action values.Like SFNEC, TESFNEC learns the time-extended successor feature values for an repetitive action of j, ψ N j : where I is the indicator function.Compared to the SFNEC method, the representation used in our proposed method is biased based on the action space, revealing the extended transitions caused by repeatedly taking the same action.This results in representations of distant states that notably differ as the discount factor γ changes.
To use SF for discovering action repetitions, the first step is to sample action from the policy π a such that ψ N j (s, a, s ′ , j) now specify two parameters: s and a. Action repetition is then the max operation of ψ N j (s, a, s ′ , j)R , where R is the learned reward vector.Traditional successor features are learned through temporal difference learning, with state occupancy as the accumulated reward, and the reward weights R are learned by online supervised learning based on expected cumulative rewards and practical rewards received.Therefore, typical temporal difference learning rules can be used to learn temporal extended successor features: Moreover, to execute a lookup using the TESFNEC method, we adopt the following equation: where ψ j l corresponds to an anteriorly stored ψ j l -value for state s l in episodic memory, and q is the kernel adopted to calculate an index of similarity between the state of query s t and sates in episodic memory s l .In this paper, we use the state vector s t as the key value in the episodic memory in our experiments.Like SFNEC, we restrict the , memory elements used during lookups to the nearest features, such as the nearest 50.Likewise, we also employ the inverse distance kernel used in 27 : q(s t , s l ) = During training processing, ψ j values are updated after observing N transitions.When the ψ j value of state- action pair (s, a) does not exist in the episodic memory, the N-step estimate calculated using Equation ( 8) is inserted into the corresponding DND for action a.On the other hand, for ψ j values already in episodic memory, the following formula is used for updates: where α is the learning rate.
The pseudo-code of the Temporally Extended Successor Feature Neural Episodic Control (TESFNEC) is as shown in Algorithm 1.

Experiment
The performance of the presented TESFNEC method is evaluated in the two-dimensional object collection domain as presented by Barreto et al. 57 (Fig. 1).This environment consists of four rooms, with the starting position located in the bottom-left corner, represented as "S, " and the target location in the top-right corner, marked as "G." Within each room, there are multiple objects belonging to three categories: circles, squares, and triangles.The objective is to navigate from the start to the target position while picking up things to maximize expected cumulative rewards.It is necessary to linearly decompose the rewards into characteristics and weights, as represented by r t = φ T t ω .These features describe the class of the collected object and whether the target state is reached in binary form: φ ∈ {0, 1} 4 .Different environmental tasks are defined by setting the weight vector ω accordingly.
To exhibit good manifestation, the agent faces a battery of tasks, each of which is a disparate instance of weight vector w, to maximize the cumulative reward sum.Generally, we adopt the same environment setting as in Barreto et al. 57 .We compared the proposed TESFNEC method with SFNEC, SFQL, and NEC methods.The comparison is based on the average returns for each task across ten runs (Fig. 2).The TESFNEC method, which combines temporally extended successor features with episodic control techniques, outperforms SFNEC, SFQL, and NEC methods regarding task performance.This is likely because TESFNEC leverages the learning speed of  www.nature.com/scientificreports/scenario control for each task and combines it with the powerful flexibility and abstract representation capabilities provided by temporal extended successor features.Furthermore, we develop experiments where the reward weighted vector ω is not provided to agents; rather, it is approximated while interacting with the environment for methods that need ω i.e., SFNEC and TESFNEC.As shown in Fig. 3, we observed a reduction in the average return across all methods that rely on ω.

Conclusion
In this paper, we introduce a novel approach that combines the flexibility of sample-efficient learning with the advantages of temporal abstraction using episodic control.Specifically, this model enhances agent training by incorporating episodic memory, facilitating the learning of superior policies.It leverages episodic memory to guide agent training, significantly reducing the number of iterations required to learn the optimal policy.During training, the model can dynamically extract historical high-reward information from episodic memory and seamlessly integrate this information into the neural network, optimizing sample utilization more effectively.Furthermore, we adopt the temporal expansion of successor features technique to capture the expected state transition dynamics of time-extended actions.This form of action abstraction does not entail learning a top-down hierarchical task structure but focuses on the bottom-up combination of actions and action repetitions.As a result, it reduces the number of decisions necessary to learn the optimal strategy without the need for hierarchical policy learning.Thus, our approach directly considers the temporal scope of sequences of time-extended actions without requiring predefined or domain-specific options.Experimental results demonstrate that our presented method optimizes learning policies faster than baseline reinforcement learning methods, leading to higher average returns.

Figure 1 .
Figure 1.Two-dimensional object collection environment presented by Barreto et al. 57 .

Figure 2 .
Figure 2. Comparison of the average reward obtained by different methods on the two-dimensional object collection environment.

Figure 3 .
Figure 3.Comparison of the average reward obtained by different methods on the two-dimensional object collection environment while learning the reward weighted vector ω.